This assignment is designed to help you get started on the final project. Be sure to review the final project instructions (https://edav.info/project.html), in particular the new section on reproducible workflow (https://edav.info/project.html#reproducible-workflow), which summarizes principles that we’ve discussed in class.
[2 points]
Amaury Sudrie (UNI : as5961) Anjani Prasad Atluri (UNI : aa4462) Maxime Tchibozo (UNI : mt3390) Thomas Causero (UNI : tc3030)
Amaury Sudrie and Thomas Causero: Data collection of the main dataset from TMDB, data cleaning of the main dataset, making the necessary plots for our 3rd major question, generating the interactive shiny app, maintaining the GitHub repository.
Anjani Prasad Atluri and Maxime Tchibozo: Data collection, cleaning and augmentation for our auxiliary dataset, NA analysis on datasets, making the necessary plots for 1st and 2nd major questions.
[6 points]
List three questions that you hope you will be able to answer from your research.
We will be testing our assumption that movie production houses tend to look at old patterns of movie genres and their financial success and try to make more movies of similar genre to earn high revenues. We do this by making a scatter plot between no of movies of a particular genre in a taken year and the revenue generated by the movies of this genre from the previous years. For example with the assumption that a movie takes 2 years to make on an average, we will take the number of movies made in each major genre in 2018 (on y-axis) and plot them against the revenue of these genres from the year 2016 (on x-axis). If there is a strong correlation, it means that since the movies belonging to these genres have performed well financially in 2016, more of such movies were made in 2018.
We see that nowadays many of the yesteryear’s famous directors (like Martin Scorsese) are coming out and speaking up against the monopoly of super hero films over the theaters. We will use our superhero dataset see the trend in the ratio of no of theaters allocated to the superhero movies over the years. We will also test to see if these movies are indeed like the “themepark” (full of spectacle and poor quality) movies that these famous directors claim them to be. We would also like to see the relation between their budget and revenue and their release date and revenue (we assume that the newer superhero movies get more revenue due to the superhero bubble rather than actual quality of the film), and their quality (critics score) and revenue. If we find a strong positive correlation between the quality and revenue, we can reject our assumtion that the huge amounts of revenue they’re generating is just due to the superhero movie bubble and not due to their actual quality.
We want to see how the bankable directors of the older generations have performed over the years leading up to 2019. We define the bankability of a director based on the revenue generated by his movies. We take the top 5 bankable directors of the deacade from 1980 to 1990 and see if their movies made it to the top 250 highest grossing movies list of other years leading upto 2019 (for our dataset we have collected the top 250 higest grossing movies per year from the years 1980 to 2019 from TMDB database), if their movies made it, we will draw a time series graph of their revenue. Through this we can see the financial performance of their movies over the years, and see if they are still bankable or not.
[2 points]
(You don’t have to use the same format for this assignment – PSet 5, part A – and the final project itself.)
Choices are:
pdf_document
html_document
bookdown book: https://bookdown.org/yihui/bookdown/
shiny app: https://shiny.rstudio.com/
(Remember that it’s ok to have pieces of the project that don’t fit into the chosen output format; in those cases you can provide links to the relevant material.)
We are planning to submit the project in a html file. We will have an interactive parallel co-ordinate plot in our final submission, so it would be helpful to make the final submission as a html file.
What is your data source? What is your method for importing data? Please be specific. Provide relevant information such as any obstacles you’re encountering and what you plan to do to overcome them.
[5 points]
We are collecting our data from multiple sources, and augmenting it from a few other sources. We first used “tmdb simple” which is a python wrapper for “The movie Database” API to collect data on top 250 highest revenue generating movies from the years 1980 to 2019. The data includes but is not limited to the release date, rating, director, revenue, budget etc. The major reason we selected TMDB database instead of IMDB is because IMDB database has the genres of the movies listed in alphabetical order, but TMDB listed them in the order of relevance to the movie. We initially faced issues with the rate limit of the API, which is 4 queries per second, to resolve this issue we set some waiting time after a couple of queries. We then stored this data in a csv file and imported into R and stored it as a dataframe.
For our analysis we needed an auxilliary dataset of superhero movies, we created our own dataset for this. The list of the superhero movies and the links to their corresonding imdb, rotten tomatoes and box office mojo pages were collected from wikipedia. The major problem with this was that not all movies had these links. Since there are not many superhero movies that lacked these links, we manually added them. After getting these links we scraped these pages to get details like release date, revenue, budget, no of theaters allocated to these movie etc. For helping our analysis with superhero dataset we also needed the details like no of theaters that were present in the US at the time of the release of these movies, and the average ticket price in the year of the release of these movies. This data was obtained from NATO (National Theater Owners Association).
[10 points]
Obseravtion 1: NA analysis
library(dplyr)
df=read.csv("clean_superhero.csv")
pr=colSums(is.na(df))/nrow(df)*100
sort(pr,decreasing = TRUE)
## opening_weekend_usa theaters gross_usa
## 6.896552 5.747126 5.747126
## gross_worldwide budget name
## 5.747126 2.298851 0.000000
## imdb rt bm
## 0.000000 0.000000 0.000000
## imdb_rating release_date critic_score_rt
## 0.000000 0.000000 0.000000
## audience_score_rt studio year
## 0.000000 0.000000 0.000000
We can see from our superhero movies dataset that we don’t have a lot of missing values in our dataset, and most of the missing values are in opening weekend gross of movies. The theaters column has the no of theaters that were allocated to the movie. gross_usa and the gross_worldwide columns show the gross amount made by the movie in the US and worldwide. The budget column gives the budget of the movie.
library(extracat)
visna(df, sort='b')
We can see form the plot that most of the rows had all the values, in the rows having missing values, the most common pattern is missing the worldwide gross, US gross, no of theaters and opening weekend. We can see that whenever budget is missing worldwide gross, USA gross, theater allocation, and opening weekend collection details are missing.
Whenever the domestic gross is missing the worldwide gross and the opening weekend collection are missing (which makes sense as without data on opening weekend collection we can’t calculate the domestic gross and without domestic gross we can’t calculate the worldwide gross). Whenever the theater allocation details are missing the opening weekend data is also missing.
Extracting release years from the release dates of the movies:
df1=df %>% mutate(year = substr(release_date, nchar(as.character(release_date))-10+1, nchar(as.character(release_date))))
df1 = df1 %>% mutate(year_of_release = substr(year, 1,4))
df1$year_of_release=as.numeric(df1$year_of_release)
Let us now see if there is any correlation between the release year, imdb score and the no of NAs in the row of the movie.
library(GGally)
df1=df1 %>% mutate(NAs=rowSums(is.na(df1)))
df2 = select(df1,year_of_release,imdb_rating,NAs)
#taking only the movies that have NA values in their rows
df3=filter(df2,df2$NAs!=0)
ggpairs(df3)
We can see that there is a strong negative correlation between the year of release and the no of NA values for the movies, (and also the imdb ratings and the no of NA values) that have missing values. It strengthens our suspision that the missing budget, revenue, theater data etc are because of the lack of records for the old superhero movies, or lack of records for movies of poor quality (most of the poorly rated movies are the old ones).
Martin Scorsese’s claim was that these superhero movies are like amuement parks that provide joy but cannot be called cinema. On interpreting this claim, we came up with a measuring metric for these kinds of movies, these movies tend to have not so good critics score (below 60 on rotten tomatoes), high-budget (above 100 million) and worldwide gross above 500 million dollars. To see this we first draw an interactive parallel Co-ordinate plot.
As we can see that there indeed exist such movies in the superhero dataset that are of “Amuzement park” type, let us get into our analysis.
Observation 2: We will be seeing the gross collections generated by these movies in the US over the years of their releases:
gross_year <- data.frame(release_date = df1$year_of_release, gross_usa = df1$gross_usa)
results <- data.frame(gross_year %>%
group_by(release_date) %>%
summarise(total_gross_usa = sum(as.double(gross_usa), na.rm = TRUE)))
results <- results[order(results['release_date']),]
results=na.omit(results)
ggplot(results, aes(release_date, y=total_gross_usa)) +
geom_line() +
geom_point() +
ggtitle("Total gross in USA over the years") +
xlab("Year") +
ylab("Gross in USA")
We can see from the overall trend that the total gross collections generated by superhero movies in the domestic market has been increasing, from the magnitude of these sharp increases we can see that the increase in gross revenue in US is not just because of inflation. We can see very sharp increase from 2015 to 2018, one of the major reasons for this is the more no of superhero movies made in these years. We can see sharp dips in the gross collection in the years 2015, 2010 and 2009. To see why this happended let us make a bar graph to see no of superhero movies that were released during these years.
ggplot(gross_year, aes(release_date)) +
geom_bar() +
ggtitle("Volume of Superhero movies over the years") +
xlab("Year") +
ylab("Volume of movies")
Six superhero movies released in 2018, and 6 in 2019, but the dip in the revenue generated in 2019 is due to the fact that the movie “Joker” hasn’t finished its full run yet and we only have its domestic gross till the time we collected the data, this brought down the average gross of 2019. We can see that 3 superhero movies were released in 2015 in contrast to the years before and after 2015 (which had more no of superhero movies), this explains the gross collections dip in 2015. The same can be said about 2009 and 2010. From the overall trend in the graph we can see that no of superhero movies being made over the years has been increasing.
Since total gross over the years is a metric that depends on the no of movies that released in that year, let us see that total average gross of the superhero movies over the years.
results1 <- data.frame(gross_year %>%
group_by(release_date) %>%
summarise(avg_gross_usa = mean(gross_usa, na.rm = TRUE)))
results1 <- results1[order(results1['release_date']),]
results1=na.omit(results1)
ggplot(results1, aes(release_date, y=avg_gross_usa)) +
geom_line() +
geom_point() +
ggtitle("Average gross in USA over the years") +
xlab("Year") +
ylab("Gross in USA")
In 1978 Superman was released, and from there on other superhero movies were released but they couldn’t perform as well as Superman of 1978 till Batman Forever in 1995. Some movies that came in the middle (1978-1995) are Supergirl in 1984, Superman IV in 1987. Batman in 1989 started to revive the revenues of the superhero movies again. The peak in 2002 is due to the movies “Blade” and “Spiderman” both of which are character from Marvel comics. The dip in 2005 was due to the movies Batman Begins, Constantine and Electra, Eventhough Batman Begins performed well the revenue generated in that year is brought down by the poor performance of Constantine and Electra (both are characters from Marvel comics). The same goes for 2010, where iron man 2 performed well but the average is brought down due to the movie Jonah Hex. The peak in 2013 is due to a slew of superhero movies namely Iron Man 3, Man of Steel, The Wolverine and Thor: The Dark World (All of which are Marvel movies). We can see that the years that generated high revenue saw the releases of relatively well known superheros and have gotten good revenue if the movie of good qulity (high ratings). But to verify this claim, we will do a scatter plot between the ratings (critics and users) of the movie and the revenue it generated to see if the movies have gotten good revenue due to the superhero bubble or due to good quality content.
While we are at it let us see the box plots of revenue and ratings (critics) and audience differentiated on the movie studio that produced them, (general consensus is that Marvel movies have better median score and revenue than their DC counterparts). Through this we can also see if there are any outliers in the data.
One can still argue that we haven’t considered the inflation, when we did the revenue analysis. So we will see how many times on an average the superhero movies are being watched (in the US) over the years. To do this we will divide the gross revenue generated by a movie with the average ticket price during the year in which the movie was released, this will give us the average no of times that particular movie was viewed, then we group by year and get the mean of average no of times all the superhero movies released in that year were watched.
Looking at the plot we can see that when accounted for inflation Batman which was released in 1989 got the highest no of audience and Superman which was released in 1978 got almost similar no of views. It has again picked up in 1998 (Blade I) till 2002 (during which year, the first Spider-man movies was released, and Blade II was released). The movies released in 2012 also got almost similar amout of views as the ones in 2002, 4 movies were released in 2012 and the major movies that caused this spike in viewership are The Avengers and the Dark Knight Rises. We can see that there has been a constant rise in viewership from 2015 to 2018. So we can see that from the past 5 years there has been a constant increase in the viewership of the super-hero movies, but their viwership was not as high as the first Batman and superman movies, it can be due to the intial excitement of comic fans to see their favorite characters come live on the screen.
Since we have seen that the trend in viewership for the superhero movies has indeed been rising from the past 5 years, let use see the trend in the theater allocation. To do this we will merge the no of theaters details present in a year in the US (obtained from the NATO website), with our superhero dataset by year. Then we will get the ratio of no of theater allocated per each movie by dividing the no of theaters allocated for this movie by the total no of theaters in the US in the year this movie released. We then group by the year and plot the average no of theaters allocated for super-hero movies over the years. Unfortunately we only have theater count in US details from the year 1995, so we will lose 11 movies that came before 1995 after the merging of both the datasets.
theaters=read.csv("theaters_in_us.csv")
#Making the comma seperated factor values into numeric
theaters[,2:4] <- lapply(theaters[,2:4],function(x){as.numeric(gsub(",", "", x))})
#Changing the name of the column to facilitate the merge
colnames(theaters)[which(names(theaters) == "Theaters")] <- "year_of_release"
dff2 <- df1 %>% right_join(theaters, by=c("year_of_release"))
#removing movies that lack theater details
dff3=dff2[!is.na(dff2$theaters),]
dff3= mutate(dff3, theater_ratio=theaters/Total)
results3=data.frame(dff3 %>%
group_by(year_of_release) %>%
summarise(avg_theaters = mean(theater_ratio, na.rm = TRUE)))
ggplot(results3, aes(year_of_release, y=avg_theaters)) +
geom_line() +
geom_point() +
ggtitle("Average ratio of theaters allocated for superhero movies in USA over the years") +
xlab("Year") +
ylab("average ratio of theaters allocated in USA")
We can see that the average ratio of no theaters allocated to these movies with the no of theaters present in those years has been increasing overall. There was a drop in the average no of theaters allocated in the year 1997, 2 movies namely, Batman and Robin and Steel released that year. Batman and Robin had 0.3933 as the ratio of no of theaters it was allocated (which is comparable to the previous year) but the average ratio is brought down because the movie “Steel”, as it was not based on a very famous superhero, it was not allocated a lot of theaters. The small downs in the trend are mostly due to lesser no of theaters being allocated to not so famous superhero characters like “catwoman” etc. We can clearly see an increasing trend in the ratio of no theaters that are taking superhero movies, and in the recent years we can see that over 70% of the theaters in the US are playing a superhero should one release during that time. This clearly proves the claim made by Mr. Scorsese that these movies are taking up more no of theaters making it hard for the smaller movies released during the same time to even get some audience.
library(ggplot2)
library(gganimate)
dff4 = data.frame(dff3 %>% group_by(studio,year_of_release) %>% summarise(avg_ratio = mean(theater_ratio, na.rm = TRUE)))
goo <- dff4 %>%
ggplot( aes(x=year_of_release, y=avg_ratio, group=studio, color=studio)) +
geom_line() +
geom_point() +
ggtitle("Popularity of American names in the previous 30 years") +
ylab("Number of babies born") + transition_reveal(year_of_release) + ease_aes('linear')
animate(goo, duration = 10, fps = 20, width = 800, height = 800, renderer = gifski_renderer())
anim_save("output.gif")
Now let us see if there is any correlation between the revenue generated by a movie and the no of theaters it was allocated. You can hover over the points to see what movies those points represent.
library(plotly)
x <- list(
title = "No of theaters allocated"
)
y <- list(
title = "Gross Domestic Revenue in Millions")
p <- plot_ly(dff3, x = ~theaters, y = ~gross_usa, type = 'scatter', mode = 'markers',
text = ~paste('name: ', name)) %>% layout(xaxis = x, yaxis = y, title="Theaters allocated v/s gross domestic revenue") %>% add_lines(y = ~fitted(loess(gross_usa ~ theaters)),
line = list(color = 'rgba(7, 164, 181, 1)'),
name = "Loess Smoother")
ggplotly(p)
Though it is not very strong, there is a correlation between no of theaters allocated for the movie and the revenue it generated. This means that other movies that are not getting enough movie theaters beacuase of superhero movies, are indeed losing on some revenue and in turn are not getting the viewership they deserve. We can also see that after 3500 theaters, there is a fast rise in the gross domestic revenue obtained by the movie.
Let us now look at how much of the domestic revenue of a superhero movie is obtained on its opening weekend. To do this, we will divide the domestic revenue obtained in the first week by the domestic revenue generated by the movie in its entire run-time. This gives us what part of the overall revenue of these movies is obtained in the first week itself. We will then use bar plot to see what part of the domestic revenue is obtained in the first week itself for most of these movies.
#removing rows with NAs in the opening weekend and USA gross columns
df=na.omit(df, cols = c("opening_weekend_usa", "gross_usa"))
df=mutate(df, r=round(opening_weekend_usa/gross_usa, digits=1))
ggplot(df, aes(r)) +
geom_bar() +
ggtitle("Part of domestic revenue obtained within the first week") + theme(plot.title = element_text(hjust = 0.5))+
xlab("ratio of opening week domestic gross to overall domestic gross") +
ylab("No of movies")
From this bar plot we can see that most of the movies (nearly 68% of the movies in the dataset) get more than 40% of their domestic revenue within the first week of their runtime itself. This sparks a possible solution to the theater allotment issue. If a theater runs 4 shows of a movie per a day, and if it is plannnig on showcasing a superhero movie, it can showcase the superhero movie in all the 4 shows for the first week and from the second week can showcase the superhero movie in 2 of its prime slots and another movie in the rest of the slots. This would help other movies (that released at the same time as the superhero movies) get the viewership they deserve.
To see why they’re being made more by the production companies, we will have to look at the return on investment of these superhero movies v/s the normal top 100 highest grossing movies from our main dataset.
Note: We are considering all the movies that are based on Marvel comics (which includes, the movies like Blade, Constantine and Howard the duck).